Code
library(alr4)
Loading required package: car
Loading required package: carData
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Lindsay Jones
November 14, 2022
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x₁ = size of home (in square feet), and x₂ = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x₁ + 2.84x₂.
A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
[1] 107296
The house sold for $37,704 greater than predicted.
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
[1] 53.8
In the given equation ŷ = −10,536 + 53.8x₁ + 2.84x₂, the slope coefficient for x₁ (size of home) is 53.8, meaning for every 1-foot increase in square footage the house selling price increases by $53.80. In the code above I demonstrate this by taking my code from part A and increasing square footage by 1, then subtracting part A’s solution from the new solution.
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
The slope coefficient for x₂ (lot size) is 2.84, meaning for every 1-foot increase in lot size the house selling price increases by $2.84. We can use the equation 2.84 * x₂ = 53.8 to calculate what increase in lot size is needed to have the same value increase in home size. We can simplify this equation as 53.8/2.84, which equals about 18.94 square feet.
The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
degree rank sex year ysdeg salary
1 Masters Prof Male 25 35 36350
2 Masters Prof Male 13 22 35350
3 Masters Prof Male 10 23 28200
4 Masters Prof Female 7 27 26775
5 PhD Prof Male 19 30 33696
6 Masters Prof Male 16 21 28516
Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
The null hypothesis is that the mean salary for men is the same as the mean salary for women, controlling for other variables. The alternative hypothesis is that the mean salary for men is NOT the same as the mean salary for women, controlling for other variables.
Two Sample t-test
data: salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-291.257 6970.550
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
The t-test shows that the mean salaries for each sex are different, but since the p-value is greater than .05, we fail to reject the null hypothesis.
Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
2.5 % 97.5 %
sexFemale -697.8183 3030.565
The above confidence interval suggests that a female professor may earn between approximately $697.82 less and $3030.57 more than a male professor.
Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.
Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
We can see from the p-values that the only statistically significant predictors are rank (both associate and professor) and year. The estimator column gives us the coefficient/slope of each variable, so we see that all variables except years since last degree are associated with an increase in salary. Years since last degree is actually associated with a decrease in salary (but again, this value is not statistically significant).
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
Call:
lm(formula = salary ~ degree + rank2 + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rank2Asst -11118.76 1351.77 -8.225 1.62e-10 ***
rank2Assoc -5826.40 1012.93 -5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
By changing the baseline category for the rank to “Prof,” I made the coefficients for the rank completely different- now rank is associated with a decrease in salary. This just means that if a professor were to become an assistant or associate professor, their salary would be expected to decrease.
Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “a variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Eliminating rank from the regression completely changes the direction for having a PhD, being female, and years since last degree.
Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
Call:
lm(formula = salary ~ hired + rank + sex + degree, data = salary)
Residuals:
Min 1Q Median 3Q Max
-6187.5 -1750.9 -438.9 1719.5 9362.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17585.6 1621.2 10.847 2.88e-14 ***
hired 319.0 1303.8 0.245 0.807777
rankAssoc 4825.3 1276.0 3.781 0.000448 ***
rankProf 11925.7 1512.4 7.885 4.37e-10 ***
sexFemale -829.2 997.6 -0.831 0.410113
degreePhD 1126.2 1018.4 1.106 0.274532
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3023 on 46 degrees of freedom
Multiple R-squared: 0.7645, Adjusted R-squared: 0.7389
F-statistic: 29.87 on 5 and 46 DF, p-value: 2.192e-13
I removed the variable ysdegree since it would be highly correlated (collinear) with the dummy variable I created, which is based on those who earned their most recent degree in the last 15 years. In this model, those hired by the new dean make $319 more than those hired by the previous dean. This is only supported at the highest significance level, so I would not say that the evidence to support this theory is strong.
Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Size and “newnesss” of the homes both seem to be a predictor, as they are both statistically significant at the given value. The coefficient for size indicates that the price is expected to increase by about $116.13 for every 1 foot increase in square footage. The coefficient for newness indicates that, controlling for everything else, a new home is expected to sell for about $57,736.28 more than an older home.
Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
The prediction equation is ŷ = -40230.867 + 116.132x + 57736.283z, where x = the size of the home; and z = 1 if the home is new or z = 0 if the home is older.
Old: price = -40230.867 + 116.132x New: price = -40230.867 + 116.132 + 57736.283 or 17505.133 + 116.132x
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
Call:
lm(formula = Price ~ Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
The interaction between size and newness appears statistically significant.
Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
Old: ŷ = -22227.808 + 104.438x New: ŷ = -22227.808 + 104.438x + -78527.502 + 61.916xz
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
We can see that the price difference for larger homes differs much more dramatically based on age than for smaller homes.
Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
Based on the p-values and the R-squared for both models, I think the model with interaction is a slightly better fit.
---
title: "Homework 4"
author: "Lindsay Jones"
description: The fourth homework
date: "11/14/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
---
# Homework 4
## Setup
```{r}
library(alr4)
library(smss)
```
## Question 1
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x₁ = size of home (in square feet), and x₂ = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x₁ + 2.84x₂.
### A
A particular home of 1240 square feet on a lot of 18,000 square feet sold for \$145,000. Find the predicted selling price and the residual, and interpret.
```{r}
#plug variables into equation
x1 = 1240
x2 = 18000
y = 145000
ybar1 <- (-10536)+(53.8*x1)+(2.84*x2)
ybar1
```
```{r}
#subtract predicted from actual to find residual
y - ybar1
```
The house sold for \$37,704 greater than predicted.
### B
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
```{r}
#plug variables into equation
x1f = 1241 #increased size by 1 square foot
x2f = 18000
yf = 145000
ybar1f <- (-10536)+(53.8*x1f)+(2.84*x2f)
#subtract old predicted price from new predicted price
ybar1f-ybar1
```
In the given equation ŷ = −10,536 + 53.8x₁ + 2.84x₂, the slope coefficient for x₁ (size of home) is 53.8, meaning for every 1-foot increase in square footage the house selling price increases by \$53.80. In the code above I demonstrate this by taking my code from part A and increasing square footage by 1, then subtracting part A's solution from the new solution.
### C
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
```{r}
53.8/2.84
```
The slope coefficient for x₂ (lot size) is 2.84, meaning for every 1-foot increase in lot size the house selling price increases by \$2.84. We can use the equation 2.84 \* x₂ = 53.8 to calculate what increase in lot size is needed to have the same value increase in home size. We can simplify this equation as 53.8/2.84, which equals about 18.94 square feet.
## Question 2
The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
```{r}
head(salary)
```
### A
Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
The null hypothesis is that the mean salary for men is the same as the mean salary for women, controlling for other variables. The alternative hypothesis is that the mean salary for men is NOT the same as the mean salary for women, controlling for other variables.
```{r}
t.test(salary ~ sex, data = salary, var.equal = TRUE)
```
The t-test shows that the mean salaries for each sex are different, but since the p-value is greater than .05, we fail to reject the null hypothesis.
### B
Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
```{r}
#run multiple linear regression
fit = lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
summary(fit)
```
```{r}
#obtain a 95% confidence interval
confint(fit, "sexFemale")
```
The above confidence interval suggests that a female professor may earn between approximately \$697.82 less and \$3030.57 more than a male professor.
### C
Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.
```{r}
summary(fit)
```
We can see from the p-values that the only statistically significant predictors are rank (both associate and professor) and year. The estimator column gives us the coefficient/slope of each variable, so we see that all variables except years since last degree are associated with an increase in salary. Years since last degree is actually associated with a decrease in salary (but again, this value is not statistically significant).
### D
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
```{r}
rank2 = relevel(salary$rank, "Prof")
fit2 = lm(formula = salary ~ degree + rank2 + sex + year + ysdeg, data = salary)
summary(fit2)
```
By changing the baseline category for the rank to "Prof," I made the coefficients for the rank completely different- now rank is associated with a decrease in salary. This just means that if a professor were to become an assistant or associate professor, their salary would be expected to decrease.
### E
Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’ ” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
```{r}
fit3 = lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
summary(fit3)
```
Eliminating rank from the regression completely changes the direction for having a PhD, being female, and years since last degree.
### F
Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
```{r}
#create a dummy variable
salary$hired <- ifelse(salary$ysdeg <= 15, 1, 0)
dean <- lm(salary ~ hired + rank + sex + degree, data = salary)
summary(dean)
```
I removed the variable ysdegree since it would be highly correlated (collinear) with the dummy variable I created, which is based on those who earned their most recent degree in the last 15 years. In this model, those hired by the new dean make $319 more than those hired by the previous dean. This is only supported at the highest significance level, so I would not say that the evidence to support this theory is strong.
## Question 3
### A
Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
```{r}
data("house.selling.price")
summary(lm(Price ~ Size + New, data = house.selling.price))
```
Size and "newnesss" of the homes both seem to be a predictor, as they are both statistically significant at the given value. The coefficient for size indicates that the price is expected to increase by about \$116.13 for every 1 foot increase in square footage. The coefficient for newness indicates that, controlling for everything else, a new home is expected to sell for about \$57,736.28 more than an older home.
### B
Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
The prediction equation is ŷ = -40230.867 + 116.132x + 57736.283z, where x = the size of the home; and z = 1 if the home is new or z = 0 if the home is older.
Old: price = -40230.867 + 116.132x
New: price = -40230.867 + 116.132 + 57736.283 or 17505.133 + 116.132x
### C
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
```{r}
#old
-40230.867 + 116.132*3000
#new
17505.133 + 116.132*3000
```
### D
Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
```{r}
newfit = lm(Price ~ Size*New, data = house.selling.price)
summary(newfit)
```
The interaction between size and newness appears statistically significant.
### E
Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
Old: ŷ = -22227.808 + 104.438x
New: ŷ = -22227.808 + 104.438x + -78527.502 + 61.916xz
### F
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
#### i
```{r}
#new
-22227.808 + 104.438*3000 + -78527.502 + 61.916*3000
```
#### ii
```{r}
#old
-22227.808 + 104.438*3000
```
### G
Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
```{r}
#new
-22227.808 + 104.438*1500 + -78527.502 + 61.916*1500
```
#### ii
```{r}
#old
-22227.808 + 104.438*1500
```
We can see that the price difference for larger homes differs much more dramatically based on age than for smaller homes.
### H
Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
Based on the p-values and the R-squared for both models, I think the model with interaction is a slightly better fit.